Developed by Uma Sivakumar - 834006815
We are modelling the demand for shared bikes with the available independent variables. It could be used by the management of the bike companies to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.
By leveraging the rich dataset provided by bike-sharing systems, the research aims to provide a comprehensive understanding of urban mobility patterns. This knowledge can inform city planning, transportation policies, and infrastructure development, ultimately leading to more sustainable and efficient urban environments. Furthermore, the project's findings may have implications for addressing broader issues related to smart city initiatives and the integration of data-driven solutions into urban planning.
Bike sharing systems represent an innovative evolution of traditional bike rentals, automating the entire process from membership acquisition to bike rental and return. Users can seamlessly rent a bike from one location and return it to another, marking a departure from conventional rental systems. Currently, there are over 500 bike-sharing programs globally, collectively offering more than 500,000 bicycles. These systems have garnered significant interest due to their pivotal role in addressing traffic congestion, environmental concerns, and public health.
Beyond their practical applications, bike-sharing systems generate data with distinctive characteristics that make them intriguing for research. Unlike other transportation services like buses or subways, bike-sharing systems explicitly record travel duration, departure, and arrival positions. This unique feature transforms bike-sharing systems into virtual sensor networks capable of sensing mobility patterns within a city. Consequently, the project anticipates that important urban events can be detected by monitoring this data.
The primary objective of the project is to model the demand for shared bikes by considering various independent variables. This modeling can prove invaluable for the management of bike companies, offering insights into how demand varies with different features. Armed with this understanding, management can strategically manipulate business approaches to meet demand levels and exceed customer expectations. Moreover, the model provides a valuable tool for comprehending the dynamics of demand in new markets.
Through the utilization of the expansive dataset provided by bike-sharing systems, the research endeavors to provide a comprehensive understanding of urban mobility patterns. The insights derived from this analysis can play a crucial role in informing city planning, shaping transportation policies, and guiding infrastructure development. The ultimate aim is to contribute to the creation of more sustainable and efficient urban environments. Furthermore, the project's findings may extend to addressing broader issues related to smart city initiatives and the integration of data-driven solutions into urban planning practices.
Source of Information : Kaggle - https://www.kaggle.com/datasets/lakshmi25npathi/bike-sharing-dataset
Description : Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
Attribute Information
We will be using the day.csv file that have the following fields:
day.csv - bike sharing counts aggregated on daily basis.
Preprocessing steps that will be done in the course of the project:
1. Handling Missing Values, Outliers, and Data Quality Issues:
Outliers: Detect and address outliers. This might involve visual inspection using box plots or statistical methods like the IQR (Interquartile Range) to filter out extreme values.
Analyzing outliers for attribute temp using the boxplot visualization.
We can observe that there is no outlier in the variable temp.
Analyzing outliers for attribute temp_feel using the boxplot visualization.
We can observe that there is no outlier in the variable temp_feel.
Analyzing outliers for attribute humidity using the boxplot visualization.
We have identified outliers in the "humidity" attribute using the Interquartile Range (IQR) statistical method. The analysis revealed 2 outliers. To address this, we established lower and upper bounds, calculated as Q1 - 1.5 IQR and Q3 + 1.5 IQR, respectively. Subsequently, we filtered the dataset to retain only the data points with humidity values falling within these bounds, effectively removing the identified outliers.
The humidity distribution, post the removal of outliers using the calculated bounds, is depicted below:
Analyzing outliers for attribute windspeed using the boxplot visualization.
We have identified outliers in the "windspeed" attribute using the Interquartile Range (IQR) statistical method. The analysis revealed 2 outliers. To address this, we established lower and upper bounds, calculated as Q1 - 1.5 IQR and Q3 + 1.5 IQR, respectively. Subsequently, we filtered the dataset to retain only the data points with windspeed values falling within these bounds, effectively removing the identified outliers.
The windspeed distribution, post the removal of outliers using the calculated bounds, is depicted below:
Analyzing outliers for attribute casual using the boxplot visualization.
We have identified outliers in the "casual" attribute using the Interquartile Range (IQR) statistical method. The analysis revealed 2 outliers. To address this, we established lower and upper bounds, calculated as Q1 - 1.5 IQR and Q3 + 1.5 IQR, respectively. Subsequently, we filtered the dataset to retain only the data points with casual values falling within these bounds, effectively removing the identified outliers.
The casual users distribution, post the removal of outliers using the calculated bounds, is depicted below:
Analyzing outliers for attribute registered using the boxplot visualization.
We can observe that there is no outlier in the variable registered.
Data Quality Issues: Check for any inconsistencies or errors in the data. This could include renaming columns, correcting typos, addressing duplicate entries, or ensuring consistency in categorical variables.
Renamed the columns for better understanding and readability.
Ensuring consistency in categorical variable (by converting their 0's and 1's to meaningful categories)
2. Building a Common Dataset:
Ensure that all relevant data is integrated into a unified dataset. This may involve merging datasets, handling different formats, or addressing data inconsistencies.
We later remove atemp from the subdataset to create a new relevant dataset after observing a high positive correlation between temp and temp_feel indicating multi-colinearity.
The sub-dataset that will used for regression is depicted below (head - just for visual):
3. Transforming Variables:
Normalization (scaling): If variables have different scales, normalization can be applied to bring them to a similar scale. Normalization methods that could be used are Min-Max scaling or Z-score normalization.
We have used Min-Max scaler to normalize or rescale the data here.
sample train data post normalization :
sample test data post normalization :
Encoding Categorical Variables: Convert categorical variables into a numerical format suitable for machine learning models. This might involve one-hot encoding, or other methods depending on the nature of the categorical data.
Here, we are using One-hot encoding to convert categorical variables into binary columns.
With this the number of column will increase from 13 to 31.
4. Performing Exploratory Data Analysis (EDA):
Visualizing Data: Use various plots and charts to visualize the data. This would include histograms for distribution analysis, scatter plots to explore relationships, and box plots to identify outliers.
Univariate Analysis
Examining the skewness and overall distribution of continuous features through the visualization of histograms and kernel density estimation.
Observations :
Examining the overall distribution of categorical features with respect to target variable count through the visualization of barplots.
Observations :
Bivariate Analysis
Visualizing the relationship between features and the target variable while considering the distinction based on weather situation.
Observations :
Visualizing the relationship between features and the target variable while considering the distinction based on working days.
Observations :
Bikes used on working days is heavily dominating the non-working days.
Visualizing the relationship between features and the target variable while considering the data for the years 2011 and 2012.
Observations :
The demand has started to increase from year 2011 to 2012.
Finding Correlations: Explore relationships between variables by calculating correlation coefficients. This can help identify which variables are strongly or weakly correlated.
Multivaraite Analysis
Plotting pairplot to find corelation between features.
Observations :
Plotting heatmap to find corelation between features.
Observations :
Statistical Methods to test hypothesis on the data:
I have implemented peason's and Spearman's Rank correlaion to perform the statistical modelling. Pearson's correlation coefficient is a valuable statistical measure for assessing the strength and direction of a linear relationship between two continuous variables. Scaling from -1 to 1, a positive coefficient indicates a positive correlation, while a negative coefficient signifies a negative correlation. Values close to 1 or -1 suggest a strong linear association, whereas those close to 0 indicate a weak correlation.
In exploring relationships between variables, researchers may consider alternative measures like Spearman's rank correlation or Kendall's tau, which are more robust to nonlinear associations and less influenced by outliers. These considerations contribute to a comprehensive analysis, enhancing the reliability and interpretability of findings in research endeavors.
Hypothesis-1
Weather Impact Hypothesis:
Hypothesis-2
Working Day Influence Hypothesis:
Alternative Hypothesis: The number of bike rentals varies significantly between working days and non-working days
Hypothesis-3
Holiday Effect Hypothesis:
Alternative Hypothesis: Bike rentals experience a change in demand during holidays.
Hypothesis-4
Temperature Impact Hypothesis:
Alternative Hypothesis: Bike rentals are influenced by temperature, with specific temperature ranges associated with higher or lower demand.
The regression models I chose for predicting the demand for shared bikes are Multiple Linear Regression, Lasso Regression, Ridge Regression and Elastic Net Regression.
Quantile, Poisson, and Negative Binomial, Cox, Partial Least Square, PCA regression models are deemed unsuitable for this dataset.Quantile regression is typically employed for predicting quantiles, while Poisson and Negative Binomial regressions are more fitting for forecasting counts of events. However, in this context, the primary objective is to predict temperature and pollutant concentrations, making these models less appropriate. Zero-inflated regression is designed for datasets with a significant number of zero values, which is not characteristic of this dataset. Cox regression, designed for survival analysis, does not align with the project's hypothesis and objectives. Additionally, PCA regression, intended for visualizing high-dimensional data, is unnecessary here. Since the project focuses on predicting a single variable, PCA does not contribute to the assumptions or goals of the analysis.
Reasons for choosing these regressors are as follows :
1. Standard Linear Regression: Standard linear regression is a straightforward model that assumes a linear relationship between the features and the target variable. If the relationship between the features (e.g., temperature, humidity, windspeed) and the bike rental count ('count') is approximately linear, a standard linear regression model can provide interpretable coefficients.
2. Ridge Regression: Ridge regression is useful when there is multicollinearity among the features. If features such as temperature and 'feels-like' temperature ('temp_feel') are highly correlated, Ridge regression can help mitigate multicollinearity by adding a regularization term to the cost function.
3. Lasso Regression: Lasso regression is beneficial when feature selection is desired. If there are many features, and some of them may not be significantly contributing to the prediction of bike rentals, Lasso can automatically shrink the coefficients of less important features to zero.
4. Elastic Net Regression: Elastic Net regression combines the advantages of Ridge and Lasso, making it suitable when there is a mix of correlated features and potential feature sparsity. It provides a balance between Ridge and Lasso regularization.
Hypothesis-5
Implemented four regression models—Multiple Linear Regression, Lasso Regression, Elastic Net Regression, and Ridge Regression. The models were instantiated, trained on the training data, and used to predict the target variable ('count'). Subsequently, model evaluation was conducted using metrics like Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, and R^2 score.
For further refinement, hyperparameter tuning was performed using GridSearchCV for Elastic Net Regression, LassoCV for Lasso Regression, and RidgeCV for Ridge Regression. The models were then instantiated with the optimized hyperparameters, trained on the training data, and used to predict the target variable.
Comparisons between models were visualized by creating a dataframe summarizing error and accuracy rates in a tabular form.
Additionally, a line graph was generated to provide a visual representation of the different errors and accuracies across various models. The results demonstrated improved performance with hyperparameter tuning compared to the initial model configurations.
By comparing all the 4 models (Linear, Lasso, Ridge, ElasticNet) we can see that the values of MSE, RMSE for Linear regression is the least and therefore a better predictor. We can also infer that the R-squared value for Linear Regression is the highest indicating the model can better explain the variation of output with different inputs(generalization of the data).
Therefore, Linear Regression is the optimal model here.
This shows the graph of the predicted count vs the actual values. We can observe from the graph that there is a 100% accuracy in predicting the demand of shared bikes.
The coefficients and intercept of the model are :
The coefficients and intercept play crucial roles in understanding the relationship between the independent variables (features) and the dependent variable (target). Here's what they imply:
Intercept (Bias): b0 - intercept represents the predicted value of the dependent variable when all independent variables are set to zero. In many cases, it doesn't have a practical interpretation unless the variables involved are meaningful when set to zero.
Coefficients (Slopes): b1, b2, ..., bn: Each coefficient represents the change in the predicted value of the dependent variable for a one-unit change in the corresponding independent variable, while holding all other variables constant.
In summary, the intercept and coefficients provide insights into the baseline value and the impact of each independent variable on the predicted outcome. They allow you to interpret how changes in the input variables contribute to changes in the predicted output in a linear manner.
The linear regression equation is represented as follows:
Y=b0+(b1∗X1+b2∗X2+...+bn∗Xn)
Y is the predicted value of the dependent variable.
b0 is the intercept.
b1,b2,...,bn are the coefficients corresponding to independent variables
X1,X2,...,Xn are the independent variables.
The Bike Share dataset provides a rich source of information for understanding urban mobility patterns and predicting bike rentals based on various features. During the exploratory data analysis (EDA) phase, Regression modelling phase, several hypotheses were tested and insights were gained:
Impact of Weather Conditions: Hypothesis: Does Weather conditions have a significant impact on bike rentals. Findings: The dataset revealed that certain weather situations, such as clear days, may influence bike rentals differently.
Working Day vs. Non-Working Day Comparison: Hypothesis: There are differences in bike rentals on working days and non-working days. Findings: Statistical tests, such as t-tests, were performed to compare bike rentals on different days, providing insights into the rental patterns. Yes, Working and non-working days do affect the shared bikes demand.
Holiday Effect Analysis: Hypothesis: Holidays have an impact on bike rentals. Findings: Statistical tests were conducted to compare bike rentals on holidays and regular days, uncovering trends and patterns around holiday periods. After the t-test we can confidently say Yes, holidays do impact bike rentals.
Temperature Impact Hypothesis: Hypothesis: Temperature has an effect on bike rentals. Findings: Spearman's rank correlation coefficient was used to evaluate the linear relationship between temperature and bike rentals, indicating a significant correlation. It does show a linear relationship and proves that temperature does have an effect on the demand of shared bikes.
Linear Regression Modeling: Hypothesis: Linear regression models can predict bike rentals based on various features. Findings: Multiple linear regression, Lasso regression, Elastic Net regression, and Ridge regression models were implemented and evaluated using metrics such as Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, and R^2 score. Hyperparameter tuning improved model performance. After which we came to a conclusion that linear regression is the most optimal model.
In conclusion, the Bike Share dataset allows for the modeling and prediction of bike rentals, with weather conditions, working days, holidays, and temperature playing significant roles. The implemented regression models provide a valuable tool for understanding and predicting bike rental patterns, enabling better management and strategy development for bike-sharing companies.
In summary, this study employed regression modeling to predict bike rental counts using a comprehensive bike share dataset. The exploratory data analysis (EDA) phase provided valuable insights into the trends, patterns, and relationships within the dataset. Multiple regression models, including linear regression, lasso regression, elastic net regression, and ridge regression, were implemented and evaluated for predicting bike rentals. The models underwent hyperparameter tuning, leading to enhanced performance, as evidenced by metrics such as Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, and R^2 score.
This study contributes reliable models for predicting bike rentals, shedding light on factors influencing demand. The findings may have practical applications for bike-sharing companies and urban planning efforts. However, it's essential to acknowledge the dataset's limitations, such as its one-year timeframe and the inherent unpredictability associated with human behavior.
Limitations:
Temporal Scope: The dataset may have a limited temporal scope, covering a specific time period. This could limit the generalizability of the findings to different seasons, years, or changing trends.
Geographical Scope: The dataset may be specific to a certain city or region, and the patterns observed may not necessarily apply to other locations with different characteristics.
Population Representativeness: The dataset may not be fully representative of the entire population using bike-sharing services. For example, if certain demographics are underrepresented in the dataset, the analysis might not capture the preferences or behaviors of those groups.
Future work:
Time Series Analysis: Explore more advanced time series analysis techniques to capture temporal patterns and trends in bike rentals. This could involve using methods like ARIMA, SARIMA, or even deep learning models tailored for time series data.
Predictive Modeling Improvement: Experiment with more advanced predictive models beyond linear regression, such as decision trees, random forests, gradient boosting, or neural networks. Evaluate their performance and compare them with the linear regression model.
User Segmentation: Segment users based on their rental patterns, and analyze each segment separately. This could provide insights into the different behaviors and preferences of user groups, helping tailor marketing or operational strategies.
Customer Behavior Analysis: Conduct a detailed analysis of customer behavior, such as preferred routes, popular pickup/drop-off locations, and user demographics. This information can guide marketing efforts and service improvements.
Integration with Weather Data: Incorporate more detailed weather data to enhance the model's accuracy. Weather conditions beyond the basic categories (clear, mist, etc.) may provide more nuanced insights into how weather impacts bike rentals.
Long-Term Trends and Seasonal Analysis: Analyze long-term trends in bike rentals and conduct a more in-depth seasonal analysis to understand how demand varies across different seasons and years.
!pip install pearsonr
!pip install sklearn
# Importing libraies required
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind
from scipy.stats import f_oneway
from scipy.stats import spearmanr
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import make_regression
from scipy.stats import uniform
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Reading the data set
bikeShare = pd.read_csv("days.csv")
bikeShare.head()
bikeShare.shape
bikeShare.info()
bikeShare.describe()
# Checking for Null Values
bikeShare.isnull().sum()
# Checking for unique values
bikeShare.nunique()
# Renaming columns for better readablity
bikeShare.rename(columns={'yr':'year','mnth':'month','weathersit':'weather_situation','atemp':'temp_feel',
'hum':'humidity','cnt':'count'}, inplace=True)
bikeShare.head()
season_codes = {1:'spring', 2:'summer', 3:'fall', 4:'winter'}
bikeShare['season'] = bikeShare['season'].map(season_codes)
bikeShare.season.head()
month_codes = {1:'January', 2:'February', 3:'March', 4:'April', 5:'May', 6:'June', 7:'July', 8:'August', 9:'September', 10:'October', 11:'November', 12:'December'}
bikeShare['month'] = bikeShare['month'].map(month_codes)
bikeShare.month.head()
weekday_codes = {0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
bikeShare['weekday'] = bikeShare['weekday'].map(weekday_codes)
bikeShare.weekday.head()
weathersit_codes = {1:'Clear', 2:'Mist', 3:'Light Snow', 4:'Heavy Rain'}
bikeShare['weather_situation'] = bikeShare['weather_situation'].map(weathersit_codes)
bikeShare.weather_situation.head()
yr_codes = {0:"2011",1:"2012"}
bikeShare['year'] = bikeShare['year'].map(yr_codes)
bikeShare.year.head()
bikeShare.head()
# Dropping the above column as it is of no use to us
bikeShare = bikeShare[['season', 'year', 'month', 'holiday', 'weekday', 'workingday', 'weather_situation', 'temp', 'temp_feel', 'humidity', 'windspeed', 'casual', 'registered', 'count']]
bikeShare.head()
bikeShare.info()
# Outlier Detection
# Plot the boxplot of temp variable.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.temp)
plt.title("Distribution of temperature", fontsize = 12, color = "brown")
plt.show()
# Plot the boxplot of temp_feel variable.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.temp_feel)
plt.title("Distribution of temperature feeeling", fontsize = 12, color = "brown")
plt.show()
# Plot the boxplot of humidity variable.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.humidity)
plt.title("Distribution of humidity", fontsize = 12, color = "brown")
plt.show()
# Calculate the IQR (Interquartile Range)
Q1 = bikeShare.humidity.quantile(0.25)
Q3 = bikeShare.humidity.quantile(0.75)
IQR = Q3 - Q1
# Define a threshold for outliers
threshold = 1.5 * IQR
# Identify and mark outliers
bikeShare['Is_Outlier'] = (bikeShare.humidity < (Q1 - threshold)) | (bikeShare.humidity > (Q3 + threshold))
# Print the outliers
outliers = bikeShare[bikeShare['Is_Outlier']]
print("\033[1mNumber of outliers:\033[0m ", outliers.shape[0])
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
bikeShare = bikeShare[(bikeShare.humidity >= lower_bound) & (bikeShare.humidity <= upper_bound)]
bikeShare.drop('Is_Outlier', inplace = True, axis = 1)
bikeShare.head()
# Checking for shape
bikeShare.shape
# Plot the boxplot of humidity variable after the removal of outliers.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.humidity)
plt.title("Distribution of humidity", fontsize = 12, color = "brown")
plt.show()
# Plot the boxplot of windspeed variable.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.windspeed)
plt.title("Distribution of windspeed", fontsize = 12, color = "brown")
plt.show()
# Calculate the IQR (Interquartile Range)
Q1 = bikeShare.windspeed.quantile(0.25)
Q3 = bikeShare.windspeed.quantile(0.75)
IQR = Q3 - Q1
# Define a threshold for outliers
threshold = 1.5 * IQR
# Identify and mark outliers
bikeShare['Is_Outlier'] = (bikeShare.windspeed < (Q1 - threshold)) | (bikeShare.windspeed > (Q3 + threshold))
# Print the outliers
outliers = bikeShare[bikeShare['Is_Outlier']]
print("\033[1mNumber of outliers:\033[0m ", outliers.shape[0])
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
bikeShare = bikeShare[(bikeShare.windspeed >= lower_bound) & (bikeShare.windspeed <= upper_bound)]
bikeShare.drop('Is_Outlier', inplace = True, axis = 1)
bikeShare.head()
# Checking for shape
bikeShare.shape
# Plot the boxplot of windspeed variable after the removal of outliers.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.windspeed)
plt.title("Distribution of windspeed", fontsize = 12, color = "brown")
plt.show()
# Plot the boxplot of casual variable.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.casual)
plt.title("Distribution of casual users", fontsize = 12, color = "brown")
plt.show()
# Calculate the IQR (Interquartile Range)
Q1 = bikeShare.casual.quantile(0.25)
Q3 = bikeShare.casual.quantile(0.75)
IQR = Q3 - Q1
# Define a threshold for outliers
threshold = 1.5 * IQR
# Identify and mark outliers
bikeShare['Is_Outlier'] = (bikeShare.casual < (Q1 - threshold)) | (bikeShare.casual > (Q3 + threshold))
# Print the outliers
outliers = bikeShare[bikeShare['Is_Outlier']]
print("\033[1mNumber of outliers:\033[0m ", outliers.shape[0])
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
bikeShare = bikeShare[(bikeShare.casual >= lower_bound) & (bikeShare.casual <= upper_bound)]
bikeShare.drop('Is_Outlier', inplace = True, axis = 1)
bikeShare.head()
# Checking for shape
bikeShare.shape
# Plot the boxplot of casual variable.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.casual)
plt.title("Distribution of casual users", fontsize = 12, color = "brown")
plt.show()# Plot the boxplot of registered variable.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.registered)
plt.title("Distribution of registered users", fontsize = 12, color = "brown")
plt.show()
# Creating a df for numerical values
num_var = bikeShare[['temp', 'temp_feel', 'humidity', 'windspeed', 'casual', 'registered']]
# Creating a df for categories
cat_var = bikeShare[['season', 'year', 'month', 'holiday', 'weekday', 'workingday', 'weather_situation', 'count']]
# Exploring numerical columns of the bikeShare dataframe
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(18, 12))
axes = axes.flatten()
for i, col in enumerate(num_var.columns):
sns.histplot(num_var[col], stat='density', kde=True, kde_kws={"cut": 3}, ax=axes[i])
plt.suptitle('Histograms depicting the distribution of Numerical variables', fontsize=16, color='Green')
plt.show()
# Exploring categorical columns of the bikeShare dataframe
for col in cat_var:
print("\033[1m" + col + "\033[0m")
print(bikeShare[col].value_counts())
print("\n")
# Plotting the categorical variables
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(14, 14))
# Flatten the axes for easier iteration
axes = axes.flatten()
# Loop through each categorical column and create a bar plot
for i, column in enumerate(cat_var): # Exclude 'cnt'
sns.barplot(x=column, y='count', data=cat_var, ax=axes[i])
axes[i].set_title(f'{column} vs. count')
axes[i].set_xlabel(column)
axes[i].set_ylabel('count')
axes[i].tick_params(axis='x', rotation=45)
# Remove the empty subplot (if any)
if len(cat_var.columns[:-1]) < len(axes.flat):
for j in range(len(cat_var.columns[:-1]), len(axes.flat)):
fig.delaxes(axes.flatten()[j])
# Adjust layout
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.suptitle('Bar Plots for Categorical Variables Against Target Variable count', fontsize=16, color="Green")
plt.show()
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(14, 14))
axes = axes.flatten()
for i, col in enumerate(num_var.columns):
sns.scatterplot(data=bikeShare,x=col,y='count',hue = bikeShare['weather_situation'], ax=axes[i])
plt.suptitle('Relationship between features and the target variable w.r.t weather situation', fontsize=16, color="Green", y=0.9)
plt.show()
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(14, 14))
axes = axes.flatten()
for i, col in enumerate(num_var.columns):
sns.scatterplot(data=bikeShare,x=col,y='count',hue = bikeShare['workingday'], ax=axes[i], palette='inferno')
plt.suptitle('Relationship between features and the target variable w.r.t working day', fontsize=16, color="Green", y=0.9)
plt.show()
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(14, 14))
axes = axes.flatten()
for i, col in enumerate(num_var.columns):
sns.scatterplot(data=bikeShare,x=col,y='count',hue = bikeShare['year'], ax=axes[i], palette='magma')
plt.suptitle('Relationship between features and the target variable w.r.t year', fontsize=16, color="Green", y=0.9)
plt.show()
# Visualizing numerical variables - pairplot
num_var1 = bikeShare[['temp', 'temp_feel', 'humidity', 'windspeed', 'casual', 'registered', 'count']]
sns.pairplot(num_var1)
plt.suptitle('Pairplot to find corelation between features', fontsize=20, color="Green", y=1.02)
plt.show()
# Let's check the correlation coefficients to see which variables are highly correlated
plt.figure(figsize = (8, 6))
sns.heatmap(num_var1.corr(), annot = True, cmap="YlGnBu")
plt.title('Heatmap to find corelation between features', fontsize=12, color="Green", y=1.02)
plt.show()
# Removing temp_feel.
bikeShare.drop(['temp_feel'],axis=1,inplace=True)
#Statistical Modelling
# Split data into four groups based on weather conditions
group_clear = bikeShare[bikeShare['weather_situation'] == 'Clear']['count']
group_mist = bikeShare[bikeShare['weather_situation'] == 'Mist']['count']
group_light_snow = bikeShare[bikeShare['weather_situation'] == 'Light Snow']['count']
# Perform one-way ANOVA
statistic, p_value = f_oneway(group_clear, group_mist, group_light_snow)
# Define significance level
alpha = 0.05
# Print the results
print(f'\033[1mANOVA Statistic:\033[0m {statistic}')
print(f'\033[1mP-value:\033[0m {p_value}')
# Compare p-value with significance level
if p_value < alpha:
print('\033[1mReject the null hypothesis: Different weather conditions affect bike rentals differently.\033[0m')
else:
print('\033[1mFail to reject the null hypothesis: No significant impact of weather conditions on bike rentals.\033[0m')
# Visualize the distribution of rentals on working and non-working days
plt.figure(figsize=(10, 6))
sns.histplot(data=bikeShare, x='count', hue='weather_situation', bins=30, kde=True)
plt.title('Distribution of Bike Rentals on different weather situations', color='green')
plt.xlabel('Count of Bike Rentals')
plt.ylabel('Frequency')
plt.legend(title='Weather situation', labels=['Clear', 'Mist', 'Light Snow'])
plt.show()
# Split data into two groups: working days and non-working days
working_days = bikeShare[bikeShare['workingday'] == 1]['count']
non_working_days = bikeShare[bikeShare['workingday'] == 0]['count']
# Perform independent t-test
statistic, p_value = ttest_ind(working_days, non_working_days)
# Define significance level
alpha = 0.05
# Print the results
print(f'\033[1mT-test Statistic:\033[0m {statistic}')
print(f'\033[1mP-value:\033[0m {p_value}')
# Compare p-value with significance level
if p_value < alpha:
print('\033[1mReject the null hypothesis: There are significant differences in bike rentals between working and non-working days.\033[0m')
else:
print('\033[1mFail to reject the null hypothesis: \033[1mNo significant differences in bike rentals between working and non-working days.\033[0m')
# Visualize the distribution of rentals on working and non-working days
plt.figure(figsize=(10, 6))
sns.histplot(data=bikeShare, x='count', hue='workingday', bins=30, kde=True)
plt.title('Distribution of Bike Rentals on Working and Non-Working Days', color='green')
plt.xlabel('Count of Bike Rentals')
plt.ylabel('Frequency')
plt.legend(title='Working Day', labels=['Non-Working Day', 'Working Day'])
plt.show()
# Split data into two groups: holidays and regular days
holidays = bikeShare[bikeShare['holiday'] == 1]['count']
regular_days = bikeShare[bikeShare['holiday'] == 0]['count']
# Perform independent t-test
statistic, p_value = ttest_ind(holidays, regular_days)
# Define significance level
alpha = 0.05
# Print the results
print(f'\033[1mT-test Statistic:\033[0m {statistic}')
print(f'\033[1mP-value:\033[0m {p_value}')
# Compare p-value with significance level
if p_value < alpha:
print('\033[1mReject the null hypothesis: There are significant differences in bike rentals between holidays and regular days.\033[0m')
else:
print('\033[1mFail to reject the null hypothesis: No significant differences in bike rentals between holidays and regular days.\033[0m')
# Visualize the distribution of rentals on holidays and regular days
plt.figure(figsize=(10, 6))
sns.histplot(data=bikeShare, x='count', hue='holiday', bins=30, kde=True)
plt.title('Distribution of Bike Rentals on Holidays and Regular Days', color='green')
plt.xlabel('Count of Bike Rentals')
plt.ylabel('Frequency')
plt.legend(title='Holiday', labels=['Regular Day', 'Holiday'])
plt.show()
# Test the correlation between temperature and bike rentals using Spearman's rank correlation
correlation, p_value = spearmanr(bikeShare['temp'], bikeShare['count'])
# Define significance level
alpha = 0.05
# Print the results
print(f'\033[1mSpearman\'s Rank Correlation:\033[0m {correlation}')
print(f'\033[1mP-value:\033[0m {p_value}')
# Compare p-value with significance level
if p_value < alpha:
print('\033[1mReject the null hypothesis: There is a significant correlation between temperature and bike rentals.\033[0m')
else:
print('\033[1mFail to reject the null hypothesis: No significant correlation between temperature and bike rentals.\033[0m')
# Visualize the relationship between temperature and bike rentals
plt.figure(figsize=(10, 6))
sns.scatterplot(data=bikeShare, x='temp', y='count')
plt.title('Relationship Between Temperature and Bike Rentals', color='green')
plt.xlabel('Temperature')
plt.ylabel('Count of Bike Rentals')
plt.show()
# Creating dummies
status = pd.get_dummies(bikeShare[['season', 'year', 'month', 'weekday', 'weather_situation']], drop_first=True)
status.head()
# concat the status dataframe created with our bikeShare_new dataframe
bikeShare_new = pd.concat([bikeShare, status], axis = 1)
# Checking the concatinated dataframe
bikeShare_new.head()
# Drop 'season', 'mnth', 'weekday', 'weathersit' as we have created the dummies for it
drop_cols = ['season', 'year', 'month', 'weekday', 'weather_situation']
bikeShare_new.drop(drop_cols, axis = 1, inplace = True)
from sklearn.model_selection import train_test_split
# We specify random_state so that the train and test data set always have the same rows, respectively
df_train, df_test = train_test_split(bikeShare_new, train_size = 0.7, test_size = 0.3, random_state = 100)
scaler = MinMaxScaler()
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ['temp', 'humidity', 'windspeed', 'casual', 'registered']
df_train[num_vars] = scaler.fit_transform(df_train[num_vars])
df_test[num_vars] = scaler.transform(df_test[num_vars])
# Let's check the correlation coefficients to see which variables are highly correlated
plt.figure(figsize = (25, 20))
sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
plt.show()
X_train = df_train.drop('count', axis=1)
y_train = df_train['count']
X_test = df_test.drop('count', axis=1)
y_test = df_test['count']
# Linear Regression
# Create a Regression Model instances and fitting the model on the training data
# Linear Regression Model
print("\033[1mLinear Regression Model\033[0m")
lr = LinearRegression()
lr.fit(X_train,y_train)
y_pred_linear = lr.predict(X_test)
print('Predicted Linear Regression values are ', y_pred_linear[1:5])
mse = mean_squared_error(y_test,y_pred_linear)
print(f"Mean Squared Error: {mse}")
rmse = mse**0.5
print(f"Root Mean Squared Error: {rmse}")
mae = mean_absolute_error(y_test,y_pred_linear)
print(f"Mean Absolute Error: {mae}")
r2score=r2_score(y_test,y_pred_linear)
print(f"R^2 Score: {r2score}")
print("\n")
# Lasso Regression Model
print("\033[1mLasso Regression Model\033[0m")
lasso = Lasso(alpha=0.5)
lasso.fit(X_train,y_train)
y_pred_lasso = lasso.predict(X_test)
print('Predicted Lasso Regression values are ', y_pred_lasso[1:5])
mse = mean_squared_error(y_test,y_pred_lasso)
print(f"Mean Squared Error: {mse}")
rmse = mse**0.5
print(f"Root Mean Squared Error: {rmse}")
mae = mean_absolute_error(y_test,y_pred_lasso)
print(f"Mean Absolute Error: {mae}")
r2score=r2_score(y_test,y_pred_lasso)
print(f"R^2 Score: {r2score}")
print("\n")
# Ridge Regression Model
print("\033[1mRidge Regression Model\033[0m")
ridge = Ridge(alpha=1)
ridge.fit(X_train,y_train)
y_pred_ridge=ridge.predict(X_test)
print('Predicted Ridge Regression values are ', y_pred_ridge[1:5])
mse = mean_squared_error(y_test,y_pred_ridge)
print(f"Mean Squared Error: {mse}")
rmse = mse**0.5
print(f"Root Mean Squared Error: {rmse}")
mae = mean_absolute_error(y_test,y_pred_ridge)
print(f"Mean Absolute Error: {mae}")
r2score=r2_score(y_test,y_pred_ridge)
print(f"R^2 Score: {r2score}")
print("\n")
# Elastic Net Regression Model
print("\033[1mElasticNet Regression Model\033[0m")
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X_train,y_train)
y_pred_elastic = elastic_net.predict(X_test)
print('Predicted ElasticNet Regression values are ', y_pred_elastic[1:5])
mse = mean_squared_error(y_test,y_pred_elastic)
print(f"Mean Squared Error: {mse}")
rmse = mse**0.5
print(f"Root Mean Squared Error: {rmse}")
mae = mean_absolute_error(y_test,y_pred_elastic)
print(f"Mean Absolute Error: {mae}")
r2score=r2_score(y_test,y_pred_elastic)
print(f"R^2 Score: {r2score}")
#Hyperparameter tunning
# ElasticNet Regression
# Define the grid of hyperparameters 'param_grid'
param_grid = {
'alpha': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
'l1_ratio': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
}
# Initialize GridSearchCV with the required parameters
grid_model_result = GridSearchCV(estimator=ElasticNet(),
param_grid=param_grid,
cv=10).fit(X_train,y_train)
# Print results
print(f"Best: {grid_model_result.best_score_} using {grid_model_result.best_params_}")
# Ridge Regression
# Defining list of alphas
alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
# Create a ridge regressor object that does cross-validation
ridge_cv = RidgeCV(alphas=alphas)
# Fit it to our training data
ridge_cv.fit(X_train,y_train)
# Get predictions for test set.
y_pred_ridgecv=ridge_cv.predict(X_test)
print('Ridge CV Model Mean Squared Error:', mean_squared_error(y_test,y_pred_ridgecv))
print('Best Alpha after Cross Validation :', ridge_cv.alpha_)
# Create a Regression Model instances and fitting the model on the training data
# Linear Regression Model
lr = LinearRegression()
lr.fit(X_train,y_train)
y_pred_linear = lr.predict(X_test)
print("\033[1mLinear Regression Model\033[0m")
print('Predicted Linear Regression values are ', y_pred_linear[1:5])
# Lasso Regression Model
lasso = Lasso(alpha=0)
lasso.fit(X_train,y_train)
y_pred_lasso = lasso.predict(X_test)
print("\n")
print("\033[1mLasso Regression Model\033[0m")
print('Predicted Lasso Regression values are ', y_pred_lasso[1:5])
# Ridge Regression Model
ridge = Ridge(alpha=0.1)
ridge.fit(X_train,y_train)
y_pred_ridge=ridge.predict(X_test)
print("\n")
print("\033[1mRidge Regression Model\033[0m")
print('Predicted Ridge Regression values are ', y_pred_ridge[1:5])
# Elastic Net Regression Model
elastic_net = ElasticNet(alpha=0.1, l1_ratio=1)
elastic_net.fit(X_train,y_train)
y_pred_elastic = elastic_net.predict(X_test)
print("\n")
print("\033[1mElasticNet Regression Model\033[0m")
print('Predicted ElasticNet Regression values are ', y_pred_elastic[1:5])
# Calculate evaluation metrics for each model
model_names = ["Linear Regression", "Lasso Regression", "Ridge Regression", "Elastic Net Regression"]
models = [lr, lasso, ridge, elastic_net]
predictions = [y_pred_linear, y_pred_lasso, y_pred_ridge, y_pred_elastic]
metrics = []
for name, model, y_pred in zip(model_names, models, predictions):
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test,y_pred)
metrics.append([name, mse, rmse, mae, r2])
# Create a DataFrame to display the metrics
df_metrics = pd.DataFrame(metrics, columns=["Model", "Mean Squared Error (MSE)", "Root Mean Sqaure Error (RMSE)", "Mean Absolute Error (MAE)", "R-squared (R^2)"])
# Display the table
print(df_metrics)
df_met = df_metrics.drop(columns=["Model"])
df_metrics.set_index(df_met.columns, inplace=True)
# Custom x-axis labels
custom_labels = ["Linear", "Lasso", "Ridge", "Elastic Net"]
# Plotting the metrics
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))
fig.suptitle('Model Evaluation Metrics')
# Plotting Mean Squared Error (MSE)
df_metrics['Mean Squared Error (MSE)'].plot(kind='line', marker='o', ax=axes[0, 0], grid=True)
axes[0, 0].set_ylabel('MSE')
axes[0, 0].set_xticks(range(len(custom_labels)))
axes[0, 0].set_xticklabels(custom_labels)
# Plotting Root Mean Square Error (RMSE)
df_metrics['Root Mean Sqaure Error (RMSE)'].plot(kind='line', marker='o', ax=axes[0, 1], grid=True)
axes[0, 1].set_ylabel('RMSE')
axes[0, 1].set_xticks(range(len(custom_labels)))
axes[0, 1].set_xticklabels(custom_labels)
# Plotting Mean Absolute Error (MAE)
df_metrics['Mean Absolute Error (MAE)'].plot(kind='line', marker='o', ax=axes[1, 0], grid=True)
axes[1, 0].set_ylabel('MAE')
axes[1, 0].set_xticks(range(len(custom_labels)))
axes[1, 0].set_xticklabels(custom_labels)
# Plotting R-squared (R^2)
df_metrics['R-squared (R^2)'].plot(kind='line', marker='o', ax=axes[1, 1], grid=True)
axes[1, 1].set_ylabel('R^2')
axes[1, 1].set_xticks(range(len(custom_labels)))
axes[1, 1].set_xticklabels(custom_labels)
plt.tight_layout(rect=[0, 0, 1, 0.96]) # Adjust the layout
plt.show()
# Plotting Regression Graph
plt.scatter(y_test, y_pred_linear)
plt.xlabel("Actual Count")
plt.ylabel("Predicted Count")
plt.title("Linear Regression: Actual vs. Predicted Count")
plt.show()
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(18, 12))
axes = axes.flatten()
plot_data = pd.DataFrame({'Actual Count': y_test, 'Predicted Count': y_pred_linear})
numerical_cols = ['temp', 'humidity', 'windspeed', 'casual', 'registered']
# Plotting individual regression plots for each numerical variable
for i, column in enumerate(numerical_cols):
sns.regplot(x=X_test[column], y=y_test, scatter_kws={'s': 15}, line_kws={'color': 'red'}, ax=axes[i])
plt.title(f"Linear Regression: Actual vs. Predicted Count for {column}")
plt.xlabel(column)
plt.ylabel("Count")
# Remove the empty subplot (if any)
if len(numerical_cols) < len(axes.flat):
for j in range(len(numerical_cols), len(axes.flat)):
fig.delaxes(axes.flatten()[j])
plt.suptitle('Linear Regression: Actual vs. Predicted Count for Numerical Variables', fontsize=16, color='Green')
plt.show()
# Coefficients for each feature
coefficients = lr.coef_
feature_names = X_train.columns
coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
sorted_coef_df = coef_df.sort_values(by='Coefficient', ascending=False)
# Intercept
intercept = lr.intercept_
print("\033[1mCoefficients:\033[0m")
print(sorted_coef_df)
print("\n")
print("\033[1mIntercept:\033[0m", intercept)